Full Analysis of Philippine Food prices from 2007 - 2025

Aims:

  1. Compare average prices across different regions for the same food item and year.
  2. Examine how prices of a specific food item have changed over time within a region.
  3. Identify regional price variations for different food items.
  4. Predict future food prices based on historical data.
Code
from itables import init_notebook_mode
import pandas as pd
import numpy as np
from scipy.stats import shapiro
from scipy.stats import kruskal
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import levene
from ipywidgets import interact
import dash
from dash import dcc, html, Input, Output
import plotly.express as px
import plotly.graph_objects as go
import scikit_posthocs as sp
import ipywidgets as widgets
from IPython.display import display, clear_output
from pmdarima import auto_arima
import warnings
import plotly.io as pio
import shiny
from shiny import ui, render, reactive
import nest_asyncio
Code
init_notebook_mode(all_interactive=True)
#open the file that was already cleaned, analyzed, and saved.
file_path = "../Data/Processed/eda_results.csv"
df = pd.read_csv(file_path)

Inspect the data

Code
# View the first few rows of the dataset
df.head()
df
Region Province Food_Items year mean median Mode Variance Standard Deviation IQR
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
# View the columns and data types of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91200 entries, 0 to 91199
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Region              91200 non-null  object 
 1   Province            91200 non-null  object 
 2   Food_Items          91200 non-null  object 
 3   year                91200 non-null  int64  
 4   mean                78875 non-null  float64
 5   median              78875 non-null  float64
 6   Mode                36432 non-null  float64
 7   Variance            76125 non-null  float64
 8   Standard Deviation  76125 non-null  float64
 9   IQR                 78875 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 7.0+ MB
Code
# View the summary statistics of the dataset though this may not make sense as they are already described in the EDA report
df.describe()
year mean median Mode Variance Standard Deviation IQR
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
#check the shape of the dataset
df.shape
(91200, 10)
Code
# get the first and last readings for each food item
print(df['year'].agg(['min', 'max']))
min    2007
max    2025
Name: year, dtype: int64

For comparing average prices across regions for the same food item and year, the closing price is generally the best price to use. Here’s why:

  1. Consistency: The closing price is considered the most representative of the market’s consensus for that period. It accounts for the entire trading session and reflects both supply and demand dynamics over time.
  2. Standard Usage: The closing price is the most widely used price in financial markets, meaning data and analysis are typically focused around this price.
  3. Simplicity: It eliminates the noise created by intra-day fluctuations, focusing on the price at the end of the trading session, which is more relevant for long-term comparisons.
Code
# Filter the dataframe to keep only rows where 'food_item' starts with 'c_'
df_filtered = df[df['Food_Items'].str.startswith('c_')]

df_filtered
Region Province Food_Items year mean median Mode Variance Standard Deviation IQR
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
df_filtered.loc[:, 'Food_Items'] = df_filtered['Food_Items'].str.replace('c_', '', regex=True)
df_filtered
Region Province Food_Items year mean median Mode Variance Standard Deviation IQR
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
# Create figure
fig = go.Figure()

# Add traces for each food item and stat
food_items = df_filtered['Food_Items'].unique()
stats = ['mean', 'median']

for food in food_items:
    for stat in stats:
        df_food = df_filtered[df_filtered['Food_Items'] == food]
        for region in df_food['Region'].unique():
            df_region = df_food[df_food['Region'] == region]
            fig.add_trace(go.Scatter(
                x=df_region['year'],
                y=df_region[stat],
                mode='lines+markers',
                name=f"{food} - {region} ({stat})",
                visible=False
            ))

# Set initial visible traces (first food item, 'mean' stat)
initial_food = food_items[0]
initial_stat = 'mean'
for trace in fig.data:
    if initial_food in trace.name and initial_stat in trace.name:
        trace.visible = True

# Create separate dropdown buttons for food items and statistics
food_buttons = []
stat_buttons = []

for food in food_items:
    visibility = [food in trace.name and initial_stat in trace.name for trace in fig.data]
    food_buttons.append({
        'label': food,
        'method': 'update',
        'args': [{'visible': visibility},
                 {'title': f"{initial_stat.capitalize()} for {food} by Region and Year"}]
    })

for stat in stats:
    visibility = [initial_food in trace.name and stat in trace.name for trace in fig.data]
    stat_buttons.append({
        'label': stat,
        'method': 'update',
        'args': [{'visible': visibility},
                 {'yaxis': {'title': stat.capitalize()}},
                 {'title': f"{stat.capitalize()} for {initial_food} by Region and Year"}]
    })

# Update layout with separate dropdown menus
fig.update_layout(
    title=f"{initial_stat.capitalize()} for {initial_food} by Region and Year",
    xaxis_title="Year",
    yaxis_title=initial_stat.capitalize(),
    template="plotly_white",
    updatemenus=[
        {
            'buttons': food_buttons,
            'direction': 'down',
            'showactive': True,
            'x': 0,
            'xanchor': 'left',
            'y': 1.15,
            'yanchor': 'top',
        },
        {
            'buttons': stat_buttons,
            'direction': 'down',
            'showactive': True,
            'x': 0.26,
            'xanchor': 'left',
            'y': 1.15,
            'yanchor': 'top',
        }
    ]
)

# Show figure
fig.show()

The graph shows the mean (average) price or median price of different food items over time for different regions in the Philippines. #### Key Observation 1. Trend - Generally, the price of all food items has been increasing over the years across most regions. There are some fluctuations and periods of stability for some items, but the overall trend is upward. 2. Regional Differences - There is significant variation in price of different food items across regions. Some regions consistently have higher prices than others. Price fluctuations also varies between regions. Some regions experience more volatile prices than others.

Factors like weather patterns, government policies, agriculture changes, food transportation and global market trends can influence these prices.

To know which regions has the highest and lowest price every year for every food item, max, min will be calculated. Then range is next to know how wide the difference between the highest and lowest price.

Code
file_path = "../Data/Interim/cleaned_food_prices.csv"
df_range = pd.read_csv(file_path)

#dropping columns related to food price index
df_nofpi_range = df_range.drop(columns=['o_food_price_index', 'h_food_price_index', 'l_food_price_index', 'c_food_price_index', 'inflation_food_price_index', 'trust_food_price_index'])

# Convert 'Date' column to datetime format
df_nofpi_range['Date'] = pd.to_datetime(df_nofpi_range['Date'])

#dropping columns related to inflation
df_noinf_range = df_nofpi_range.drop(columns=['inflation_beans','inflation_cabbage', 'inflation_carrots', 'inflation_eggs', 'inflation_garlic', 'inflation_meat_beef_chops', 'inflation_meat_chicken_whole', 'inflation_meat_pork', 'inflation_onions', 'inflation_potatoes', 'inflation_rice', 'inflation_tomatoes'])

#dropping columns related to trust scores
df_cleaned_range = df_noinf_range.drop(columns=['trust_beans','trust_cabbage', 'trust_carrots', 'trust_eggs', 'trust_garlic', 'trust_meat_beef_chops', 'trust_meat_chicken_whole', 'trust_meat_pork', 'trust_onions', 'trust_potatoes', 'trust_rice', 'trust_tomatoes'])

#dropping uneeded columns
df_unneeded_range = df_cleaned_range.drop(columns=['country', 'City', 'lat', 'lon', 'Province', 'Date', 'month'])

# Reshaping from wide to long format (including year and month as part of the identifier)
df_range = df_unneeded_range.melt(id_vars=['Region', 'year'], var_name='Food_Items', value_name='Price')

df_range_filtered = df_range[df_range['Food_Items'].str.startswith('c_')]

df_range_filtered.loc[:, 'Food_Items'] = df_range_filtered['Food_Items'].str.replace('c_', '', regex=True)

df_range_filtered
Region year Food_Items Price
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code

# Dropdown for selecting food items
food_items = df_range_filtered['Food_Items'].unique()
dropdown = widgets.Dropdown(options=food_items, description="Food Item:")

# Output widget for displaying the plot
output = widgets.Output()

def update_plot(selected_food_item):
    with output:
        clear_output(wait=True)  # Ensure only one graph is displayed
        filtered_dfr = df_range_filtered[df_range_filtered['Food_Items'] == selected_food_item]
        price_stats = filtered_dfr.groupby('year')['Price'].agg(['min', 'max', 'mean']).reset_index()
        
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=price_stats['year'], y=price_stats['min'], mode='lines+markers', name='Min Price'))
        fig.add_trace(go.Scatter(x=price_stats['year'], y=price_stats['max'], mode='lines+markers', name='Max Price'))
        fig.add_trace(go.Scatter(x=price_stats['year'], y=price_stats['mean'], mode='lines+markers', name='Average Price'))
        
        fig.update_layout(
            title=f'Min, Max, and Average Prices Per Year for {selected_food_item}',
            xaxis_title='Year',
            yaxis_title='Price',
            legend_title='Price Type'
        )
        display(fig)  # Display updated graph

def on_dropdown_change(change):
    update_plot(change['new'])

# Attach event listener
dropdown.observe(on_dropdown_change, names='value')

# Display UI elements once
display(dropdown, output)

# Show initial plot
update_plot(food_items[0])
Code
# Dash app setup
apph = dash.Dash(__name__)

apph.layout = html.Div([
    html.H3("Highest Prices Per Year by Region"),
    
    # Dropdown for selecting food items
    dcc.Dropdown(
        id='food-item-dropdown',
        options=[{'label': item, 'value': item} for item in df_range_filtered['Food_Items'].unique()],
        value=df_range_filtered['Food_Items'].unique()[0],
        placeholder="Select a food item"
    ),
    
    # Graph for displaying highest prices
    dcc.Graph(id='highest-price-graph')
])

# Callback to update the graph based on selected food item
@apph.callback(
    Output('highest-price-graph', 'figure'),
    [Input('food-item-dropdown', 'value')]
)
def update_graph(selected_food_item):
    # Filter data for the selected food item
    filtered_dfr = df_range_filtered[df_range_filtered['Food_Items'] == selected_food_item]
    
    # Find the highest price per year and the corresponding region
    highest_prices = (
        filtered_dfr.loc[filtered_dfr.groupby('year')['Price'].idxmax()]
        .reset_index(drop=True)
    )
    
    # Create the bar chart
    fig = px.bar(
        highest_prices,
        x='year',
        y='Price',
        color='Region',  # Highlight the region in the bar color
        title=f'Highest Prices Per Year for {selected_food_item}',
        labels={'Price': 'Price', 'year': 'year', 'Region': 'Region'}
    )
    
    return fig

# Run the app
if __name__ == '__main__':
    apph.run_server(debug=True, port=8051, mode='inline', name="apph")
Code
# Dash app setup
appl = dash.Dash(__name__)

appl.layout = html.Div([
    html.H3("lowest Prices Per Year by Region"),
    
    # Dropdown for selecting food items
    dcc.Dropdown(
        id='food-item-dropdown',
        options=[{'label': item, 'value': item} for item in df_range_filtered['Food_Items'].unique()],
        value=df_range_filtered['Food_Items'].unique()[0],
        placeholder="Select a food item"
    ),
    
    # Graph for displaying lowest prices
    dcc.Graph(id='lowest-price-graph')
])

# Callback to update the graph based on selected food item
@appl.callback(
    Output('lowest-price-graph', 'figure'),
    [Input('food-item-dropdown', 'value')]
)
def update_graph(selected_food_item):
    # Filter data for the selected food item
    filtered_dfr = df_range_filtered[df_range_filtered['Food_Items'] == selected_food_item]
    
    # Find the lowest price per year and the corresponding region
    lowest_prices = (
        filtered_dfr.loc[filtered_dfr.groupby('year')['Price'].idxmin()]
        .reset_index(drop=True)
    )
    
    # Create the bar chart
    fig = px.bar(
        lowest_prices,
        x='year',
        y='Price',
        color='Region',  # Highlight the region in the bar color
        title=f'Lowest Prices Per Year for {selected_food_item}',
        labels={'Price': 'Price', 'year': 'year', 'Region': 'Region'}
    )
    
    return fig

# Run the app, changing port to make it have different URL and not interfere with other app
if __name__ == '__main__':
    appl.run_server(debug=True, port=8052, mode='inline', name="appl")
Code
# Dash App Setup
apprange = dash.Dash(__name__)

apprange.layout = html.Div([
    html.H3("Price Range of Food Items per Year Across Regions"),
    
    # Dropdown for selecting food items
    dcc.Dropdown(
        id='food-item-dropdown',
        options=[{'label': item, 'value': item} for item in df_range_filtered['Food_Items'].unique()],
        value=df_range_filtered['Food_Items'].unique()[0],
        placeholder="Select a food item"
    ),
    
    # Graph for displaying the price range (min and max) per year
    dcc.Graph(id='price-range-graph')
])

# Callback to update the graph based on selected food item
@apprange.callback(
    Output('price-range-graph', 'figure'),
    [Input('food-item-dropdown', 'value')]
)
def update_graph(selected_food_item):
    # Filter data for the selected food item
    filtered_df = df_range_filtered[df_range_filtered['Food_Items'] == selected_food_item]
    
    # Get the minimum and maximum price per year
    price_range = (
        filtered_df.groupby('year')['Price']
        .agg(['min', 'max'])
        .reset_index()
    )
    
    # Create the graph (min and max prices as lines)
    fig = px.line(
        price_range, 
        x='year', 
        y=['min', 'max'], 
        title=f'Price Range (Min & Max) Per Year for {selected_food_item} Across Regions',
        labels={'year': 'year', 'value': 'Price'},
        line_shape='linear'
    )
    
    return fig

# Run the app
apprange.run_server(debug=True, port=8053, mode='inline', name="apprange")

Key Observations: 1. Overall Trend - Both the minimum and maximum prices show a general upward trend over the years, indicating an overall increase in the price of every food items. Some even recorded highest price increase on 2020, likely due to the COVID-19 pandemic and its impact on supply chains. 2. Price Range - The gap between the minimum and maximum prices for most items widens significantly in 2020 and 2021. This suggests that there were greater variations in prices across regions during these years.

Posssible causes: 1. COVID-19 Pandemic - The pandemic disrupted supply chains and increased demand for certain food items, which could have led to price volatility. 2. Regional Differences - Differences in the impact of the pandemic on regional economies and agricultural production could have contributed to price variations.

To truly know if the regions really have price difference over the years, hypothesis testing must be done. Please refer to .._testing.ipynb to know why Kruskall-Wallis was used here.

Code
#open the csv file that was already analyzed with kruskall wallis test.
hypothesis_results = "../Data/Processed/hypothesis_testing_result.csv"
hyp_df = pd.read_csv(hypothesis_results)
Code
# View the first few rows of the dataset
print(hyp_df.head())
   Year  Food_Item  H-statistic        p-value           Result
0  2007      beans          NaN            NaN  Not Significant
1  2007    c_beans  1077.368029  2.491117e-218      Significant
2  2007  c_cabbage  1105.762042  2.066749e-224      Significant
3  2007  c_carrots  1093.290901  9.693593e-222      Significant
4  2007     c_eggs   905.101619  1.726764e-181      Significant
Code
# Filter the dataframe to keep only rows where 'food_item' starts with 'c_'
hyp_df_filtered = hyp_df[hyp_df['Food_Item'].str.startswith('c_')]

print(hyp_df_filtered)
      Year    Food_Item  H-statistic        p-value       Result
1     2007      c_beans  1077.368029  2.491117e-218  Significant
2     2007    c_cabbage  1105.762042  2.066749e-224  Significant
3     2007    c_carrots  1093.290901  9.693593e-222  Significant
4     2007       c_eggs   905.101619  1.726764e-181  Significant
5     2007     c_garlic   393.719714   3.807277e-73  Significant
...    ...          ...          ...            ...          ...
1088  2025  c_meat_pork    44.757215   2.646148e-04  Significant
1089  2025     c_onions   101.512409   4.662566e-14  Significant
1090  2025   c_potatoes    90.632301   4.683642e-12  Significant
1091  2025       c_rice    67.209085   6.484158e-08  Significant
1092  2025   c_tomatoes    70.027332   2.133517e-08  Significant

[228 rows x 5 columns]
Code
# removing prefix _c in the food items
hyp_df_filtered.loc[:, 'Food_Item'] = hyp_df_filtered['Food_Item'].str.replace('c_', '', regex=True)
print(hyp_df_filtered)
      Year  Food_Item  H-statistic        p-value       Result
1     2007      beans  1077.368029  2.491117e-218  Significant
2     2007    cabbage  1105.762042  2.066749e-224  Significant
3     2007    carrots  1093.290901  9.693593e-222  Significant
4     2007       eggs   905.101619  1.726764e-181  Significant
5     2007     garlic   393.719714   3.807277e-73  Significant
...    ...        ...          ...            ...          ...
1088  2025  meat_pork    44.757215   2.646148e-04  Significant
1089  2025     onions   101.512409   4.662566e-14  Significant
1090  2025   potatoes    90.632301   4.683642e-12  Significant
1091  2025       rice    67.209085   6.484158e-08  Significant
1092  2025   tomatoes    70.027332   2.133517e-08  Significant

[228 rows x 5 columns]
Code
year_dropdown = widgets.Dropdown(
    options=hyp_df_filtered["Year"].unique(),
    value=hyp_df_filtered["Year"].min(),
    description="Select Year:",
    style={'description_width': 'initial'}
)

# Output widget to display the table
output = widgets.Output()

def update_table(change):
    with output:
        output.clear_output()
        display(hyp_df_filtered[hyp_df_filtered["Year"] == year_dropdown.value])

# Trigger update on change
year_dropdown.observe(update_table, names="value")

# Display widgets
display(year_dropdown, output)

# Initial Table Display
update_table(None)
Code
# or you can just count the number of unique Results
hyp_df_filtered["Result"].value_counts()
count
Result
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

This means all the items for all the regions over the year have different prices. But which regions have different prices?

Code
# Define the available items and years for the dropdowns
items = df_range_filtered['Food_Items'].unique().tolist()
years = df_range_filtered['year'].unique().tolist()

# Step 1: Create the function that will update the graph based on selected item and year
def update_heatmap(selected_item, selected_year):
    # Filter for the selected item and year
    filter_df = df_range_filtered[(df_range_filtered['Food_Items'] == selected_item) & (df_range_filtered['year'] == selected_year)]
    
    # Perform Dunn's Test to check which regions differ
    regions = filter_df['Region'].unique()
    dunn_result = sp.posthoc_dunn(filter_df, val_col='Price', group_col='Region')

    # Prepare heatmap data (p-values matrix)
    p_values_matrix = np.zeros((len(regions), len(regions)))

    for i, region1 in enumerate(regions):
        for j, region2 in enumerate(regions):
            if region1 != region2:
                p_value = dunn_result.loc[region1, region2]
                p_values_matrix[i, j] = p_value
            else:
                p_values_matrix[i, j] = np.nan  # No need for comparison within the same region

    # Create a DataFrame for the heatmap using the regions as both rows and columns
    heatmap_df = pd.DataFrame(p_values_matrix, columns=regions, index=regions)

    # Replace NaN values with a large value (optional, to handle missing pairwise comparisons)
    heatmap_df = heatmap_df.fillna(1.0)  # Optionally replace NaNs with 1 (indicating no comparison)

    # Custom colorscale: green for 1 and red for 0
    custom_colorscale = [
        [0, "darkred"], # 0 -> red
        [0.166, "red"],
        [0.332, "lightcoral"],
        [0.5, "yellow"],
        [0.66, "palegreen"],
        [0.832, "green"],
        [1, "darkgreen"]   # 1 -> green  
    ]

    # Create the heatmap figure with custom colorscale
    fig = go.Figure(data=go.Heatmap(
        z=heatmap_df.values,
        x=heatmap_df.columns,
        y=heatmap_df.index,
        colorscale=custom_colorscale,
        colorbar=dict(title='p-value'),
        zmin=0, zmax=1  # Normalize the range from 0 to 1
    ))

    # Update layout to improve axis labels and graph size
    fig.update_layout(
        title=f"Dunn's Test p-values for {selected_item} in {selected_year} by Region",
        xaxis_title='Region',
        yaxis_title='Region',
        autosize=False,  # Disable autosize
        width=800,  # Set width of the figure
        height=800,  # Set height of the figure
        margin=dict(l=100, r=100, t=100, b=100),  # Increase margins for better visibility
        xaxis=dict(tickmode='array', tickvals=heatmap_df.columns, ticktext=heatmap_df.columns, tickangle=45),  # Rotate x-axis labels for better readability
        yaxis=dict(tickmode='array', tickvals=heatmap_df.index, ticktext=heatmap_df.index, tickangle=0)  # Rotate y-axis labels for better readability
    )

    return fig

# Step 2: Create the Dash app
appDunns = dash.Dash(__name__)

# Define the layout with dropdowns and the graph
appDunns.layout = html.Div([
    html.H3("Price Comparison Heatmap by Region"),
    
    # Dropdown for selecting item
    html.Label('Select Item:'),
    dcc.Dropdown(
        id='item-dropdown',
        options=[{'label': item, 'value': item} for item in items],
        value=items[0],  # default value
        style={'width': '50%'}
    ),
    
    # Dropdown for selecting year
    html.Label('Select Year:'),
    dcc.Dropdown(
        id='year-dropdown',
        options=[{'label': year, 'value': year} for year in years],
        value=years[0],  # default value
        style={'width': '50%'}
    ),
    
    # Graph for showing the heatmap
    dcc.Graph(id='heatmap-graph')
])

# Step 3: Create callback to update the heatmap based on dropdown selection
@appDunns.callback(
    dash.dependencies.Output('heatmap-graph', 'figure'),
    [dash.dependencies.Input('item-dropdown', 'value'),
     dash.dependencies.Input('year-dropdown', 'value')]
)
def update_graph(selected_item, selected_year):
    return update_heatmap(selected_item, selected_year)

# Run the app
if __name__ == '__main__':
    appDunns.run_server(debug=True, port=8054, mode='inline', name="appDunns")

This heatmap visualizes the results of a Dunn’s post-hoc test for significant differences in prices of different food items between regions across the years.

The color scale indicates the following: * Red (Close to 0): Indicates a statistically significant difference in bean prices between two regions. * Light Green to Yellow (Around 0.2 to 0.8): Suggests a possible difference, but not statistically significant based on the chosen significance level (typically 0.05) * Dark Green (Close to 0): Indicates no statistically significant difference in bean prices between the two regions.

Key Observations

  1. Significant Differences Exist: The presence of so many red cells indicates that there are statistically significant differences in food prices in almost all regions at any given years.
  2. No Significant Differences: patches of green indicate pairs of regions where the Dunn’s test did not find a statistically significant difference in food prices, but there are only few of them.
  3. Market Average Comparison: The “Market Average” row/column allows you to see how each region’s prices compare to the overall average. Dark red cells in this row/column would indicate regions with prices significantly different from the market average.

Food price Prediction 3 years into the future

Code
#dropping uneeded columns
df_unneeded_range = df_cleaned_range.drop(columns=['country', 'City', 'lat', 'lon', 'Province', 'year', 'month'])

# Reshaping from wide to long format (including year and month as part of the identifier)
df_range = df_unneeded_range.melt(id_vars=['Region', 'Date'], var_name='Food_Items', value_name='Price')

df_range_filtered = df_range[df_range['Food_Items'].str.startswith('c_')]

df_range_filtered.loc[:, 'Food_Items'] = df_range_filtered['Food_Items'].str.replace('c_', '', regex=True)

# Sort by date
df = df_range_filtered.sort_values(by=['Region', 'Date'])
Code
# Suppress specific warnings
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")
warnings.filterwarnings("ignore", category=FutureWarning, module="pmdarima")
warnings.filterwarnings("ignore", category=DeprecationWarning, module="pmdarima")


# Convert date column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Sort by date
df = df.sort_values(by=['Region', 'Date'])

# Dropdown for interactive visualization
regions = df['Region'].unique()
items = df['Food_Items'].unique()

region_dropdown = widgets.Dropdown(options=regions, description='Region:')
item_dropdown = widgets.Dropdown(options=items, description='Food Item:')

def preprocess_data(region, item):
    region_df = df[(df['Region'] == region) & (df['Food_Items'] == item)]
    region_df = (
        region_df.groupby('Date', as_index=False)['Price']
        .mean()  # Aggregate duplicate dates by averaging
    )
    region_df = region_df.set_index('Date').asfreq('MS')  # Ensure monthly frequency
    region_df['Price'] = region_df['Price'].interpolate()  # Fill missing values
    return region_df

def update_graph(region, item):
    region_df = preprocess_data(region, item)

    if len(region_df) < 12:
        print(f"Not enough data for {region} - {item} to forecast.")
        return

    # Use AutoARIMA to find the best model
    model = auto_arima(region_df['Price'], seasonal=True, m=12, stepwise=True, trace=True, max_order=None, stationary=False)
    
    # Forecast the next 36 months
    forecast = model.predict(n_periods=36)

    # Plot the historical data and forecast
    plt.figure(figsize=(12, 6))
    plt.plot(region_df.index, region_df['Price'], label='Historical Prices')
    plt.plot(pd.date_range(region_df.index[-1], periods=37, freq='MS')[1:], forecast, label='Forecast', linestyle='dashed', color='red')
    plt.xlabel('Date')
    plt.ylabel('Price')
    plt.legend()
    plt.title(f'Price Forecast for {region} - {item} (Next 36 Months)')
    plt.show()

interactive_plot = widgets.interactive(update_graph, region=region_dropdown, item=item_dropdown)
display(interactive_plot)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[223], line 30, in update_graph(selected_food_item='cabbage')
     26 filtered_dfr = df_range_filtered[df_range_filtered['Food_Items'] == selected_food_item]
     28 # Find the highest price per year and the corresponding region
     29 highest_prices = (
---> 30     filtered_dfr.loc[filtered_dfr.groupby('year')['Price'].idxmax()]
        highest_prices =                               Region  year Food_Items   Price
0   Cordillera Administrative region  2007      beans   95.74
1   Cordillera Administrative region  2008      beans  112.15
2   Cordillera Administrative region  2009      beans  117.54
3   Cordillera Administrative region  2010      beans  116.16
4   Cordillera Administrative region  2011      beans  108.70
5   Cordillera Administrative region  2012      beans  105.28
6   Cordillera Administrative region  2013      beans  108.55
7   Cordillera Administrative region  2014      beans  101.18
8   Cordillera Administrative region  2015      beans  106.72
9   Cordillera Administrative region  2016      beans  106.34
10  Cordillera Administrative region  2017      beans  116.21
11  Cordillera Administrative region  2018      beans  116.00
12  Cordillera Administrative region  2019      beans  109.47
13                         Region VI  2020      beans  206.06
14                       Region IV-B  2021      beans  151.75
15                       Region IV-B  2022      beans  131.95
16                          Region X  2023      beans  133.52
17  Cordillera Administrative region  2024      beans  130.00
18  Cordillera Administrative region  2025      beans  124.01
        filtered_dfr =                                   Region       Date Food_Items  Price
449407  Cordillera Administrative region 2007-01-01    cabbage  26.26
449408  Cordillera Administrative region 2007-02-01    cabbage  24.80
449409  Cordillera Administrative region 2007-03-01    cabbage  24.33
449410  Cordillera Administrative region 2007-04-01    cabbage  24.36
449411  Cordillera Administrative region 2007-05-01    cabbage  24.68
...                                  ...        ...        ...    ...
473055                    Market Average 2024-09-01    cabbage  83.99
473056                    Market Average 2024-10-01    cabbage  87.18
473057                    Market Average 2024-11-01    cabbage  95.69
473058                    Market Average 2024-12-01    cabbage  87.56
473059                    Market Average 2025-01-01    cabbage  87.43

[23653 rows x 4 columns]
     31     .reset_index(drop=True)
     32 )
     34 # Create the bar chart
     35 fig = px.bar(
     36     highest_prices,
     37     x='year',
   (...)
     41     labels={'Price': 'Price', 'year': 'year', 'Region': 'Region'}
     42 )

File c:\Users\Admin\Desktop\personal_projects\Data_Analysis\.venv\Lib\site-packages\pandas\core\frame.py:9183, in DataFrame.groupby(
    self=                                  Region       D...01-01    cabbage  87.43

[23653 rows x 4 columns],
    by='year',
    axis=0,
    level=None,
    as_index=True,
    sort=True,
    group_keys=True,
    observed=<no_default>,
    dropna=True
)
   9180 if level is None and by is None:
   9181     raise TypeError("You have to supply one of 'by' and 'level'")
-> 9183 return DataFrameGroupBy(
        self =                                   Region       Date Food_Items  Price
449407  Cordillera Administrative region 2007-01-01    cabbage  26.26
449408  Cordillera Administrative region 2007-02-01    cabbage  24.80
449409  Cordillera Administrative region 2007-03-01    cabbage  24.33
449410  Cordillera Administrative region 2007-04-01    cabbage  24.36
449411  Cordillera Administrative region 2007-05-01    cabbage  24.68
...                                  ...        ...        ...    ...
473055                    Market Average 2024-09-01    cabbage  83.99
473056                    Market Average 2024-10-01    cabbage  87.18
473057                    Market Average 2024-11-01    cabbage  95.69
473058                    Market Average 2024-12-01    cabbage  87.56
473059                    Market Average 2025-01-01    cabbage  87.43

[23653 rows x 4 columns]
        by = 'year'
        axis = 0
        level = None
        as_index = True
        sort = True
        group_keys = True
        observed = <no_default>
        dropna = True
   9184     obj=self,
   9185     keys=by,
   9186     axis=axis,
   9187     level=level,
   9188     as_index=as_index,
   9189     sort=sort,
   9190     group_keys=group_keys,
   9191     observed=observed,
   9192     dropna=dropna,
   9193 )

File c:\Users\Admin\Desktop\personal_projects\Data_Analysis\.venv\Lib\site-packages\pandas\core\groupby\groupby.py:1329, in GroupBy.__init__(
    self=<pandas.core.groupby.generic.DataFrameGroupBy object>,
    obj=                                  Region       D...01-01    cabbage  87.43

[23653 rows x 4 columns],
    keys='year',
    axis=0,
    level=None,
    grouper=None,
    exclusions=None,
    selection=None,
    as_index=True,
    sort=True,
    group_keys=True,
    observed=<no_default>,
    dropna=True
)
   1326 self.dropna = dropna
   1328 if grouper is None:
-> 1329     grouper, exclusions, obj = get_grouper(
        obj =                                   Region       Date Food_Items  Price
449407  Cordillera Administrative region 2007-01-01    cabbage  26.26
449408  Cordillera Administrative region 2007-02-01    cabbage  24.80
449409  Cordillera Administrative region 2007-03-01    cabbage  24.33
449410  Cordillera Administrative region 2007-04-01    cabbage  24.36
449411  Cordillera Administrative region 2007-05-01    cabbage  24.68
...                                  ...        ...        ...    ...
473055                    Market Average 2024-09-01    cabbage  83.99
473056                    Market Average 2024-10-01    cabbage  87.18
473057                    Market Average 2024-11-01    cabbage  95.69
473058                    Market Average 2024-12-01    cabbage  87.56
473059                    Market Average 2025-01-01    cabbage  87.43

[23653 rows x 4 columns]
        grouper = None
        grouper, exclusions, obj = (None, None,                                   Region       Date Food_Items  Price
449407  Cordillera Administrative region 2007-01-01    cabbage  26.26
449408  Cordillera Administrative region 2007-02-01    cabbage  24.80
449409  Cordillera Administrative region 2007-03-01    cabbage  24.33
449410  Cordillera Administrative region 2007-04-01    cabbage  24.36
449411  Cordillera Administrative region 2007-05-01    cabbage  24.68
...                                  ...        ...        ...    ...
473055                    Market Average 2024-09-01    cabbage  83.99
473056                    Market Average 2024-10-01    cabbage  87.18
473057                    Market Average 2024-11-01    cabbage  95.69
473058                    Market Average 2024-12-01    cabbage  87.56
473059                    Market Average 2025-01-01    cabbage  87.43

[23653 rows x 4 columns])
        exclusions = None
        keys = 'year'
        axis = 0
        level = None
        sort = True
        lib.no_default = <no_default>
        observed is lib.no_default = True
        observed = <no_default>
        lib = <module 'pandas._libs.lib' from 'c:\\Users\\Admin\\Desktop\\personal_projects\\Data_Analysis\\.venv\\Lib\\site-packages\\pandas\\_libs\\lib.cp312-win_amd64.pyd'>
        self.dropna = True
        self = <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024004023A10>
   1330         obj,
   1331         keys,
   1332         axis=axis,
   1333         level=level,
   1334         sort=sort,
   1335         observed=False if observed is lib.no_default else observed,
   1336         dropna=self.dropna,
   1337     )
   1339 if observed is lib.no_default:
   1340     if any(ping._passed_categorical for ping in grouper.groupings):

File c:\Users\Admin\Desktop\personal_projects\Data_Analysis\.venv\Lib\site-packages\pandas\core\groupby\grouper.py:1043, in get_grouper(
    obj=                                  Region       D...01-01    cabbage  87.43

[23653 rows x 4 columns],
    key='year',
    axis=0,
    level=None,
    sort=True,
    observed=False,
    validate=True,
    dropna=True
)
   1041         in_axis, level, gpr = False, gpr, None
   1042     else:
-> 1043         raise KeyError(gpr)
        gpr = 'year'
   1044 elif isinstance(gpr, Grouper) and gpr.key is not None:
   1045     # Add key to exclusions
   1046     exclusions.add(gpr.key)

KeyError: 'year'

Key Observations

  1. Historical Price Volatility: The blue line representing historical prices shows significant fluctuations, especially noticeable spikes around 2008 and smaller ones in other years. This indicates that most food prices have been subject to considerable volatility over the past two decades.
  2. Recent Price Increase: There’s a clear upward trend in food prices in the period leading up to the forecast (roughly from 2020 to 2024). This suggests recent factors have been pushing prices higher.
  3. Relatively Stable Forecast: The red dashed line, representing the forecast for the next 36 months, shows a relatively stable and slightly upward trend. The model predicts a continued but moderated increase compared to the recent past.

Key Observations

  1. Historical Volatility:
    • Weather Patterns:adverse weather events in key agricultural regions can significantly impact supply and cause price volatility.
    • Changes in Demand: Fluctuations in global demand, driven by population growth or economic factors, can also contribute to price swings.
  2. Recent Price Increase:
    • Inflation: General inflation in the economy could be pushing up agricultural commodity prices.
    • Supply Chain Issues: Disruptions to global supply chains (e.g., due to pandemics, conflicts, or trade policies) can lead to shortages and price increases.
    • Increased Input Costs: Rising costs of fertilizers, fuel, and labor can increase the cost of food production, which could be passed on to consumers.
  3. Relatively Stable Forecast:
    • Model Assumption: ARIMA models assume that, to some extent, future trends will resemble past patterns. The stable forecast suggests the model has not picked up any strong upward or downward pressures for the next 36 months beyond what is already captured in the recent trend.